TODO
Initial exploration: Cross Correlation
Our first approach to this DAP looked at the relationship between the indicator and the signal more generally. We first used cross correlation analysis on the time series to identify the relationship between indicators and cases across a time period. For two time series \(y, x \in \mathbb{R}^T\), cross-correlation is defined as:
\[\max_{i} Corr(y_{i+1,\cdots, T}, x_{1, \cdots, T-i}),\] and measures the maximum Pearson correlation between the two as a result of lagging one by the amount of \(i\).
We calculated the cross-correlation and the optimal lag in each county. An example of this data over all observed counties for the Drs Visits indicator signal:
An example of this data over all observed counties for the Drs Visits indicator signal:
In order to ascertain leading-ness of an indicator, we want to determine whether the indicator began to rise significantly before cases began to rise significantly.
As a core component of this analysis, we need methodology to accurately identify significant rises, given a single time series of an indicator. This is non-trivial, since the data is quite noisy at the county level, and clean rise/drops are rare.
Starting at the Peak
At first we experimented with finding the peak of a signal in a given time period and identifying the closest local minimum that precedes the peak. However this is not always the point at which the signal actually begins to rise (could be caught in a shallow local minimum), and does not gaurantee the rise would be a lengthy or a steep one. This method also only picks one rise period for a signal for every county for the given time period, which isn’t always reflective of the signal’s actual behavior.
Best Fit Line
One option we tried was calculating a line of best fit for the signal for fixed time periods within a larger time period. For example, calculating a line of best fit for every 21 day window within a 3 month window and choosing the period that has the highest slope as the most significant rise period in that county for that signal.
Example plot
Estimated Derivative
We then tried using multiple different derivative estimate methods to identify periods where the estimated derivative at each point is over a certain threshold.
Example plot
We saw that smoothing the signal first using smoothing splines (in addition to the 7-day average smoothing already applied to the data, e.g. 7-day average CLI) and using the derivative method produced the best results. Twekaing this method with some other decision rules gave us our best outcome for finding periods of significant rise.
Final criteria for rise periods: A period is a significant rise in a smoothed signal if
First derivative at each point is > 0 - this means the signal is in fact rising on every day
Period is > a certain number of days (for this analysis we used TODO) - this means the rise is not spurious
Each first derivative is > a certain % of other derivatives in time period (Note, for this analysis we set this to 0%, effectively not using this parameter) - if not set to 0, it can mean the rise is a significant one for this county but also ties this decision to the specific time period we are looking at. The rise point identifications can change based on the time period if this is set greater than 0.
Magnitude of increase from start to end of period is > a certain threshold (for this analysis we used TODO) - this is another way to make sure that the rise is significant, not just a slight uptick in cases
Finally, we take the point at the beginning of each rise period as the best estimation of a point of inflection where a signal begins to rise significantly, so we can address the question: Does the beginning of a rise in the indicator come before the beginning of a rise in cases?
In our analysis, we look at this on a county by county basis, for specific time periods during the pandemic, like the “Summer wave” or the “Fall wave.” TODO is this still true?
In our analysis, we include all counties that have greater than 2000 cases (a little over 20 cases a day for a 3 month window), 80 days of indicator data for a 3 month window, and do not have zero or negative values for either cases or the indicator. TODO is this still true?
TODO
In this section, we describe our pipeline for processing, plotting and analyzing the data using the methodology described above.
As an example, we’ll use our Dr Visits % CLI as our indicator, and the summer as our time period. We use our LeadingIndicatorTools package for all our main functions.
drs_visits_prepared_summer = get_and_parse_signals("2020-06-01", "2020-8-31", "doctor-visits", "smoothed_adj_cli", 2000, 80)
## Warning: The `...` argument of `group_keys()` is deprecated as of dplyr 1.0.0.
## Please `group_by()` first
drs_summer = get_increase_points(drs_visits_prepared_summer$cases, drs_visits_prepared_summer$indicator)
plot_signals(drs_summer, "01003", smooth_and_show_increase_point=FALSE, "Drs Visits")
In the respective rise point columns, the day is marked with a 1 if it is found to be a rise point for that signal. We can see here that there is a rise point for Drs Vists on 6/18 and for cases on 6/26.
drs_summer[1]
## [[1]]
## time_value geo_value case_value ind_value case_rise_point
## 1 2020-06-01 01003 2.285714 2.591397 1
## 2 2020-06-02 01003 2.142857 2.057542 0
## 3 2020-06-03 01003 1.571429 1.621858 0
## 4 2020-06-04 01003 1.714286 1.472914 0
## 5 2020-06-05 01003 2.000000 1.672025 0
## 6 2020-06-06 01003 3.000000 1.778825 0
## 7 2020-06-07 01003 3.571429 1.885602 0
## 8 2020-06-08 01003 4.000000 2.008342 0
## 9 2020-06-09 01003 4.714286 2.300228 0
## 10 2020-06-10 01003 5.571429 2.182529 0
## 11 2020-06-11 01003 7.142857 1.955341 0
## 12 2020-06-12 01003 8.142857 1.869405 0
## 13 2020-06-13 01003 8.142857 1.710156 0
## 14 2020-06-14 01003 7.285714 1.601059 0
## 15 2020-06-15 01003 9.000000 1.517344 0
## 16 2020-06-16 01003 9.142857 1.332368 0
## 17 2020-06-17 01003 8.714286 1.107573 0
## 18 2020-06-18 01003 8.285714 1.351734 0
## 19 2020-06-19 01003 8.571429 1.745868 0
## 20 2020-06-20 01003 8.428571 2.264708 0
## 21 2020-06-21 01003 9.428571 2.834244 0
## 22 2020-06-22 01003 7.714286 3.241748 0
## 23 2020-06-23 01003 8.714286 3.510611 0
## 24 2020-06-24 01003 10.285714 3.674556 0
## 25 2020-06-25 01003 10.857143 3.581878 0
## 26 2020-06-26 01003 14.571429 3.290597 0
## 27 2020-06-27 01003 19.285714 3.585068 0
## 28 2020-06-28 01003 20.714286 3.962742 0
## 29 2020-06-29 01003 29.428571 4.295182 0
## 30 2020-06-30 01003 32.857143 4.992063 0
## 31 2020-07-01 01003 34.142857 5.485732 0
## 32 2020-07-02 01003 39.142857 5.911132 0
## 33 2020-07-03 01003 47.142857 6.084344 0
## 34 2020-07-04 01003 44.000000 6.427596 0
## 35 2020-07-05 01003 43.714286 6.822692 0
## 36 2020-07-06 01003 38.285714 7.248379 0
## 37 2020-07-07 01003 45.285714 8.381805 0
## 38 2020-07-08 01003 50.428571 8.751762 0
## 39 2020-07-09 01003 54.285714 8.718551 0
## 40 2020-07-10 01003 48.857143 8.918230 0
## 41 2020-07-11 01003 51.571429 9.440457 0
## 42 2020-07-12 01003 59.000000 9.903577 0
## 43 2020-07-13 01003 64.000000 10.334842 0
## 44 2020-07-14 01003 59.571429 8.921196 0
## 45 2020-07-15 01003 66.000000 7.922708 0
## 46 2020-07-16 01003 66.857143 7.748389 0
## 47 2020-07-17 01003 71.714286 7.546518 0
## 48 2020-07-18 01003 85.000000 7.418840 0
## 49 2020-07-19 01003 91.857143 7.416810 0
## 50 2020-07-20 01003 93.428571 7.473701 0
## 51 2020-07-21 01003 98.285714 7.966899 0
## 52 2020-07-22 01003 96.857143 7.355723 0
## 53 2020-07-23 01003 123.142857 6.977930 0
## 54 2020-07-24 01003 117.714286 6.833789 0
## 55 2020-07-25 01003 120.428571 7.083536 0
## 56 2020-07-26 01003 110.142857 7.285879 0
## 57 2020-07-27 01003 117.571429 7.476056 0
## 58 2020-07-28 01003 104.714286 7.393050 0
## 59 2020-07-29 01003 91.285714 6.987763 0
## 60 2020-07-30 01003 81.000000 6.626541 0
## 61 2020-07-31 01003 84.000000 6.682756 0
## 62 2020-08-01 01003 68.571429 6.462142 0
## 63 2020-08-02 01003 73.571429 6.311417 0
## 64 2020-08-03 01003 61.285714 6.214892 0
## 65 2020-08-04 01003 69.285714 6.607147 0
## 66 2020-08-05 01003 77.857143 6.786898 0
## 67 2020-08-06 01003 58.571429 6.640317 0
## 68 2020-08-07 01003 57.571429 6.277139 0
## 69 2020-08-08 01003 66.285714 5.670363 0
## 70 2020-08-09 01003 54.714286 5.093669 0
## 71 2020-08-10 01003 64.142857 4.641709 0
## 72 2020-08-11 01003 59.428571 4.583945 0
## 73 2020-08-12 01003 56.571429 4.629656 0
## 74 2020-08-13 01003 53.571429 4.489354 0
## 75 2020-08-14 01003 50.857143 4.255767 0
## 76 2020-08-15 01003 43.285714 4.266932 0
## 77 2020-08-16 01003 48.857143 4.282707 0
## 78 2020-08-17 01003 35.142857 4.287946 0
## 79 2020-08-18 01003 34.428571 3.873270 0
## 80 2020-08-19 01003 32.285714 3.521795 0
## 81 2020-08-20 01003 31.714286 3.491025 0
## 82 2020-08-21 01003 27.714286 3.445380 0
## 83 2020-08-22 01003 29.428571 3.262581 0
## 84 2020-08-23 01003 28.428571 3.155857 0
## 85 2020-08-24 01003 29.571429 3.075220 0
## 86 2020-08-25 01003 30.428571 3.269225 0
## 87 2020-08-26 01003 37.571429 3.194299 0
## 88 2020-08-27 01003 39.428571 2.938526 0
## 89 2020-08-28 01003 41.857143 2.783125 0
## 90 2020-08-29 01003 44.142857 3.147181 0
## 91 2020-08-30 01003 54.000000 3.435437 0
## 92 2020-08-31 01003 54.000000 3.613825 0
## indicator_rise_point
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
## 11 0
## 12 0
## 13 0
## 14 0
## 15 1
## 16 0
## 17 0
## 18 0
## 19 0
## 20 0
## 21 0
## 22 0
## 23 0
## 24 0
## 25 0
## 26 0
## 27 0
## 28 0
## 29 0
## 30 0
## 31 0
## 32 0
## 33 0
## 34 0
## 35 0
## 36 0
## 37 0
## 38 0
## 39 0
## 40 0
## 41 0
## 42 0
## 43 0
## 44 0
## 45 0
## 46 0
## 47 0
## 48 0
## 49 0
## 50 0
## 51 0
## 52 0
## 53 0
## 54 0
## 55 0
## 56 0
## 57 0
## 58 0
## 59 0
## 60 0
## 61 0
## 62 0
## 63 0
## 64 0
## 65 0
## 66 0
## 67 0
## 68 0
## 69 0
## 70 0
## 71 0
## 72 0
## 73 0
## 74 0
## 75 0
## 76 0
## 77 0
## 78 0
## 79 0
## 80 0
## 81 0
## 82 0
## 83 0
## 84 0
## 85 0
## 86 0
## 87 0
## 88 0
## 89 0
## 90 0
## 91 0
## 92 0
We can see that Drs Visits begins to rise before cases rise. TODO I think we need to tweak our rise point method a bit so we don’t have these “double counting” points on a rise.
plot_signals(drs_summer, "01003", smooth_and_show_increase_point=TRUE, "Drs Visits")
TODO on this whole section. Need to rework and add in Vishnu’s analysis.
## Warning: Some inputs were not uniquely matched; returning only the first match
## in each case.
## Warning: Some inputs were not uniquely matched; returning only the first match
## in each case.
## Warning: Some inputs were not uniquely matched; returning only the first match
## in each case.
## Warning: Some inputs were not uniquely matched; returning only the first match
## in each case.
##### Fall Frequencies We can also look at the distribution of the frequency of the number of days by which Doctor Visits’ rises lead case rises in successful counties
We can also look at the distribution of the frequency of the number of days by which Doctor Visits’ rises lead case rises in successful counties
## Warning: Some inputs were not uniquely matched; returning only the first match
## in each case.
## Warning: Some inputs were not uniquely matched; returning only the first match
## in each case.
We can also look at the distribution of the frequency of the number of days by which Change Healthcare rises lead case rises in successful counties ##### Summer Frequencies We can also look at the distribution of the frequency of the number of days by which Change Healthcare rises lead case rises in successful counties
We can also look at the distribution of the frequency of the number of days by which Indicator Combination rises lead case rises in successful counties ##### Summer Frequencies We can also look at the distribution of the frequency of the number of days by which Change Healthcare rises lead case rises in successful counties
Note that the counties shown and counted as “successful” here met the following criteria: All indicator rise points were followed by a case rise point within 3 to 14 days (and all case rise points were preceded by an indicator rise point within the same time period). This means that the displayed counties almost always have only one rise point per signal and case (the more rise points the harder it is to meet the criteria).
Recall and precision.